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ABSTRACT 

Runtime  time  preprocessing  plays  a  major  role  in  many  efficient  algorithms  in  computer 
science,  as  well  as  playing  an  important  role  in  exploiting  multiprocessor  architectures.  We 
give  examples  that  elucidate  the  importance  of  run  time  preprocessing  and  show  how  these 
optimizations  can  be  integrated  into  compilers.  To  support  our  arguments,  we  describe 
transformations  implemented  in  prototype  multiprocessor  compilers  and  present  benchmarks 
from  the  iPSC2/860,  the  CM-2,  and  the  Encore  Multimax/320. 
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1  Introduction 

1.1  Overview 


In  many  algorithms,  data  produced  or  input  during  a  program’s  initialization  plays  a  large 
role  in  determining  the  nature  of  the  subsequent  computation.  When  the  data  structures 
that  define  a  computation  have  been  initialized,  a  preprocessing  phase  follows.  Vital  elements 
of  the  strategy  used  by  the  rest  of  the  algorithm  are  determined  by  this  preprocessing  phase. 

To  effectively  exploit  many  multiprocessor  architectures,  we  may  also  have  to  carry  out 
run  time  preprocessing.  This  preprocessing  will  be  referred  to  as  runtime  compilation.  The 
purpose  of  runtime  compilation  is  not  to  determine  which  computations  are  to  be  performed 
but  instead  to  determine  how  a  multiprocessor  machine  will  schedule  the  algorithm’s  work, 
how  to  map  the  data  structures  and  how  data  movement  within  the  multiprocessor  is  to  be 
scheduled.  In  this  paper,  we  specifically  address  problems  for  which  computational  patterns 
can  be  predicted  when  values  assigned  to  key  data  structures  are  known.  These  problems 
include  computations  on  non-uniform  meshes,  sparse  direct  factorization  which  does  not 
involve  pivoting  and  sparse  iterative  linear  solvers.  (^(<X  j  C  - * 

Values  obtained  during  program  execution  can  affect  the  nature  and  degree  of  potential 
concurrency.  Runtime  compilation  may  be  needed  to  identify  and  exploit  concurrency.  Com¬ 
plex  heterogeneous  memory  hierarchies  characterize  virtually  all  multiprocessor  architectures 
with  more  than  a  few  dozen  processors.  Primary  memory  is  divided  among  processors.  To 
obtain  data  from  other  portions  of  the  primary  memory  of  the  multiprocessor,  we  typically 
need  to  access  a  communications  network.  Program  performance  can  be  dramatically  affected 
by  the  scheduling  of  data  movement  among  processors. 

There  has  been  much  research  carried  out  on  methods  for  runtime  parallelization  as  well 
as  runtime  workload  and  data  partitioning.  Most  parallelization  and  problem  partitioning 
methods  explicitly  or  implicitly  specify  patterns  of  interprocessor  communication.  When 
patterns  of  computation  are  determined  by  data  structures  initialized  during  program  ex¬ 
ecution,  traditional  compiler  techniques  cannot  possibly  carry  out  these  partitioning  and 
scheduling  operations.  Only  recently  have  methods  been  developed  that  can  integrate  the 
kinds  of  runtime  optimizations  mentioned  above  into  compilers  and  programming  environ¬ 
ments. 

2  Algorithmic  Execution  Time  Preprocessing 

In  many  efficient  approaches  to  solving  problems  in  computing,  data  produced  or  input  dur¬ 
ing  program  execution  plays  a  large  role  in  determining  computational  patterns.  Examples 
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include: 


•  Most  searching  and  sorting  problems 

•  Critical  path  analysis 

•  Game  tree  and  decision  tree  manipulations 

•  Direct  and  iterative  sparse  linear  system  solvers 

Once  an  appropriate  subset  of  the  input  (or  generated)  data  is  available,  it  is  frequently 
worthwhile  to  perform  some  preprocessing.  This  preprocessing  can  take  many  forms,  but 
results  of  the  preprocessing  determine  vital  elements  of  the  strategy  used  by  the  remainder 
of  the  algorithm.  A  simple  example  of  this  is  the  method  of  interpolated  binary  search. 
The  number  of  computations  required  for  a  simple  binary  search  of  a  sorted  list  depends  on 
the  values  of  the  elements  in  the  list  and  on  the  value  of  the  key.  We  can  preprocess  the 
sorted  list  and  use  the  distribution  of  element  values  in  the  list  to  produce  an  interpolation 
function  that  is  used  to  direct  the  search.  It  is  frequently  possible  to  amortize  the  cost  of 
preprocessing.  In  the  interpolated  binary  search  example,  once  preprocessing  is  carried  out, 
we  can  used  the  interpolation  function  to  search  for  a  sequence  of  different  keys. 

Some  other  examples  of  well  known  algorithms  that  carry  out  preprocessing  to  determine 
vital  elements  of  the  strategy  used  by  the  remainder  of  the  algorithm  are: 

a.  Creation  of  indices  to  speed  database  retrieval  where  indices  are  created  to  allow  the 
use  of  efficient  search  methods  on  many  different  database  keys  [30]. 

b.  Generation  of  threaded  binary  search  trees  where  extra  links  are  added  to  a  binary 
tree  to  speed  tree  traversal  [16]. 

c.  Matrix  reordering  and  symbolic  factorization  used  in  sparse  direct  linear  equation 
solvers.  In  such  problems,  the  number  and  pattern  of  computations  in  a  sparse  matrix 
factorization  is  determined  by  the  order  in  which  steps  in  the  factorization  are  carried 
out.  In  many  cases  it  is  possible  to  use  the  non-zero  structure  of  a  matrix  to  predeter¬ 
mine  the  order  in  which  computations  will  be  carried  out  and  to  allocate  the  memory 
needed  to  store  the  resulting  factored  matrix  [12]. 

In  each  of  these  examples,  the  results  of  a  single  preprocessing  computation  can  be  used  to 
solve  any  member  of  a  class  of  structurally  similar  problems.  In  the  database  example,  the 
creation  of  an  index  can  be  followed  by  an  arbitrary  number  of  queries.  Once  a  threaded 
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binary  search  tree  is  generated,  the  resulting  data  structure  can  be  used  in  an  arbitrary  num¬ 
ber  of  tree  traversals.  A  symbolic  matrix  factorization  can  be  used  to  speed  the  factorization 
of  any  matrix  with  a  given  pattern  of  non-zero  entries. 

Runtime  compilation  techniques  attempt  to  discover  how  to  maximize  the  performance 
of  algorithms  on  multiprocessors.  Since  these  methods  are  particularly  useful  in  algorithms 
whose  computational  patterns  depend  on  values  assigned  to  data  structures  during  program 
execution,  a  significant  preprocessing  cost  is  frequently  involved.  In  runtime  compilation,  we 
are  also  often  able  to  amortize  costs  of  preprocessing  among  a  number  of  structurally  similar 
computational  phases. 


3  Run-Time  Parallelization 

Run-time  parallelization  is  perhaps  the  most  obvious  form  of  multiprocessor  runtime  compila¬ 
tion.  Parallelization  carried  out  during  compilation  is  necessarily  conservative.  If  a  compiler 
cannot  figure  out  how  to  generate  a  correct  parallelizing  loop  transformation,  loop  iterations 
have  to  be  performed  sequentially.  Many  loop  nests  defy  compile-time  parallelization  be¬ 
cause  dependency  patterns  are  determined  by  variables  or  arrays  initialized  during  program 
execution.  One  way  of  carrying  out  runtime  parallelization  is  to  analyze  the  inter- iteration 
dependency  pattern  in  a  loop  nest  to  identify  wavefronts  of  concurrently  executable  loop  iter¬ 
ations.  Using  a  form  of  run  time  preprocessing,  we  transform  a  loop  nest  with  inter-iteration 
dependencies  into  a  sequence  of  parallel  loops.  Execution  time  preprocessing  is  frequently 
used  to  parallelize  sparse  numerical  algorithms,  such  as  those  arising  in  sparse  direct  and 
iterative  linear  solvers  [2],  [4],  [24],  [11],  [1], 

Typically,  programmers  need  to  explicitly  code  the  procedures  that  carry  out  the  nec¬ 
essary  run  time  preprocessing.  It  is  possible  to  produce  a  runtime  parallelization  program 
transformation  that  generates  code  designed  to  perform  run-time  loop  parallelization  [28]. 
The  compiler  transforms  a  loop  into  two  separate  code  segments.  The  first  code  segment, 
the  inspector,  finds  sets  of  independent  loop  iterations  while  the  second  code  segment,  the 
executor ,  carries  out  the  scheduled  work.  Runtime  parallelization  transformations  have  been 
implemented  in  a  prototype  compiler  targeted  at  shared  memory  machines  [29].  Runtime 
compilation  only  handles  a  subset  of  the  possible  types  of  runtime  parallelization.  Our  trans¬ 
formations  only  apply  to  loop  nests  in  which  inter-iteration  dependencies  do  not  depend  on 
the  results  of  computations  carried  out  within  the  loop  nest.  There  are  a  number  of  al¬ 
gorithms  that  merge  the  process  of  identifying  and  performing  concurrent  work.  It  seems 
likely  to  us  that  it  will  be  possible  to  produce  compiler  transformations  that  generate  hybrid 
inspector/  executors  for  more  fully  dynamic  algorithms,  but  we  will  not  address  this  issue 
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do  i=l,N 


y(i)  =  a(i)+y(ia(i))  +  b(i)*y(ib(i)) 
end  do 


Figure  1:  Sequential  Code  to  be  Parallelized 

further  in  this  paper. 

To  clarify  the  scheme,  we  now  present  a  simple  example.  A  simple  sequential  program 
is  presented  in  Figure  1.  Note  that  right  hand  side  references  to  array  y  use  a  level  of 
indirection.  The  inspector  used  to  perform  runtime  parallelization  (Figure  2)  is  simply  a 
topological  sort.  This  sort  can  be  generated  from  the  parse  tree  produced  by  the  loop  in 
Figure  1.  The  inspector  in  Figure  2  is  sequential  but  can  be  parallelized  using  the  principles 
to  be  described  later  in  this  section.  Once  the  wavefront  corresponding  to  each  index  is 
known,  we  can  sort  the  indices  in  order  of  increasing  wavefront  number  to  produce  the  array 
schedule.  The  inspector  also  initializes  a  pointer  array  count.  Array  count  contains  the 
address  in  schedule  of  the  beginning  of  each  wavefront.  Loop  iterations  corresponding  to 
wavefront  i  are  found  in  schedule  between  count  (i)  and  count  (i+l)-l. 

The  executor  in  Figure  3  is  a  sequence  of  parallel  do  loops  that  run  over  consecutive 
wavefronts  obtained  by  the  inspector  from  the  sequential  code  in  Figure  1.  Note  that  to 
obtain  the  correct  solution  in  the  executor  we  need  to  maintain  two  copies  of  the  array  y 
found  in  the  sequential  code.  In  Figure  1  we  call  these  copies  y  and  ynew. 

In  evaluating  the  usefulness  of  run-time  parallelization,  the  cost  of  the  preprocessing  must 
be  taken  into  account.  In  [29]  we  present  timings  obtained  from  the  run-time  parallelization 
transformation  applied  to  sparse  lower  triangular  solves.  On  an  18  processor  Encore  Mul¬ 
timax/320,  a  single  processor  required  241  milliseconds  to  solve  a  lower  triangular  system 
obtained  from  an  incomplete  factorization  of  one  of  the  Boeing  Harwell  test  matrices.  On 
16  processors  of  the  Multimax,  the  inspector  required  100  milliseconds  and  the  executor  re¬ 
quired  23  milliseconds.  In  many  situations,  we  can  amortize  the  cost  of  an  inspector  because 
we  need  to  repeatedly  carry  out  a  given  pattern  of  computations.  For  instance,  in  iterative 
linear  systems  solvers  we  may  need  to  repeatedly  solve  the  same  sparse  triangular  systems 
with  different  right  hand  sides. 

A  variety  of  tradeoffs  can  be  made  between  the  costs  and  benefits  of  preprocessing.  We 
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wf(l:N)  =  0 
do  i=l,N 

wf(i)  =  max(wf(i),wf(ia(i)),wf(ib(i)))  +  1 
end  do 

Use  wf()  to  produce  scheduleQ,  a  list  of  indices  in  order  of  increasing  wavefront  number 


Figure  2:  Parallelizing  Inspector 


do  phase  =  1,  np 

parallel  do  i=count(phase),count(phase+l)-l 
ii  =  schedule(i) 

if(ia(ii).lt.ii)  then  tmpl  =  ynew(ia(ii)) 
else  tmpl  =  y(ia(ii))  endif 

if(ib(ii).lt.ii)  then  tmp2  =  ynew(ib(ii)) 
else  tmp2  =  y(ib(ii))  endif 

ynew(ii)  =  a(ii)*tmpl  +  b(ii)*tmp2 
end  parallel  do 

end  do 

y(l:n)  =  ynew(l:n) 

Figure  3:  Parallelizing  Executor 
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can  dispense  with  reordering  loop  iterations  into  concurrent  wavefronts  and  still  be  able  to 
exploit  parallelism  to  a  degree  by  using  a  preprocessed  doacross  transformation  [27].  In  a 
doacross  construct  [9],  loop  iterations  are  partitioned  between  processors  in  a  striped  fashion 
and  synchronization  calls  are  introduced  so  that  computations  from  some  loop  iterations  can 
be  overlapped.  Doacross  loops  typically  make  use  of  a-priori  knowledge  of  inter-iteration 
dependencies  to  carry  out  needed  inter-iteration  synchronizations.  It  is  possible  to  carry 
out  a  relatively  small  amount  of  run  time  preprocessing  and  postprocessing  that  eliminates 
the  need  for  a-priori  knowledge  of  dependencies.  On  machines  with  snooping  caches  (such 
as  the  Multimax/320),  it  is  efficient  to  synchronize  using  shared  arrays.  The  following  is  a 
sketch  of  some  of  the  transformations  involved  in  generating  preprocessed  doacross  loops,  a 
much  more  detailed  description  may  be  found  in  [27].  A  shared  array  ready  is  initialized 
to  NOTDONE.  When  a  left  hand  side  array  element  i  is  calculated,  ready  (i)  is  set  to  DONE. 
Processors  needing  to  use  an  updated  value  of  array  element  i  busy  wait  on  ready  (i)  until 
ready (i)  is  set  to  DONE.  In  preprocessed  doacross  loops ,  we  need  to  maintain  two  copies  of 
shared  arrays  that  appear  on  the  left  hand  side  of  expressions  during  the  computation.  After 
the  loop  is  completed,  the  two  shared  array  copies  need  to  be  reconciled. 

The  run  time  initialization  and  postprocessing  in  the  preprocessed  doacross  loop  are  rel¬ 
atively  inexpensive  compared  to  the  preprocessing  costs  incurred  by  a  parallelizing  inspector 
(e.g.  figure  2).  For  the  above  cited  lower  triangular  solve  involving  the  incompletely  factored 
Boeing  Harwell  test  matrix,  the  preprocessed  doacross  loop  requires  45  milliseconds.  This 
can  be  compared  to  the  23  milliseconds  required  to  carry  out  the  runtime  parallelized  solve 
and  the  100  millisecond  preprocessing  time  of  the  inspector. 

Runtime  parallelization  can  be  carried  out  on  a  variety  of  architectures.  In  this  paper, 
we  discuss  runtime  parallelization  only  in  the  context  of  shared  memory  architectures;  a 
discussion  of  runtime  parallelization  for  distributed  memory  machines  is  found  in  [26]. 

4  Runtime  Compilation  for  Distributed  Memory  Ma¬ 
chines 

4.1  Distributed  Memory  Inspectors  and  Executors 

In  distributed  memory  machines,  large  data  arrays  need  to  be  partitioned  between  local 
memories  of  processors.  These  partitioned  data  arrays  are  called  distributed  arrays.  We 
follow  the  usual  practice  of  assigning  long  term  storage  of  distributed  array  data  to  specific 
memory  locations  in  the  distributed  machine.  A  processor  that  needs  to  read  an  array 
element  must  fetch  a  copy  of  that  element  from  the  memory  of  the  processor  in  which  that 
array  element  is  stored.  Alternately,  a  processor  may  need  to  store  a  value  in  an  off-processor 
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Each  processor  P: 

—  preprocesses  its  own  loop  iterations 

—  Records  off-processor  fetches  and  stores  in  hashed  cache 

—  Finds  send/receive  calls  required  for  data  exchange 

(i)  P  generates  list  of  all  off-processor  data  to  be  fetched 

(ii)  Other  processors  tell  P  which  data  to  send 

(iii)  Send/Receive  pairs  generated  and  stored 


Figure  4:  Inspector  For  Parallel  Loop  on  Distributed  Memory  Multiprocessor 

distributed  array  element.  Local  copies  of  off-processor  distributed  array  elements  are  stored 
in  hash  tables  called  hashed  caches.  Run-time  procedures  carry  out  the  movement  of  data 
between  processors  and  manage  the  above  mentioned  hash  tables. 

In  distributed  memory  MIMD  architectures,  there  is  typically  a  non-trivial  communi¬ 
cations  latency  or  startup  cost  [7].  For  efficiency  reasons,  information  to  be  transmitted 
should  be  collected  into  relatively  large  messages.  The  cost  of  fetching  array  elements  can 
be  reduced  by  precomputing  what  data  each  processor  needs  to  send  and  to  receive. 

In  Figure  4,  we  outline  the  preprocessing  we  performed  to  implement  a  parallel  loop  on  a 
distributed  machine.  The  distribution  of  parallel  loop  indices  to  processors  determines  where 
computations  are  to  be  performed.  We  assume  that  all  needed  distributed  arrays  have  been 
defined  and  initialized  and  that  loop  iterations  have  been  partitioned  between  processors. 
Using  the  hashed  cache  to  record  off-processor  fetches  and  stores  allows  us  to  recognize  when 
more  than  one  reference  is  being  made  to  the  same  off-processor  distributed  array  element, 
so  that  only  one  copy  of  that  element  need  be  fetched  or  stored. 

During  our  inspector  phase,  we  carry  out  a  set  of  interprocessor  communications  that 
allows  us  to  anticipate  exactly  which  send  and  receive  communication  calls  each  processor 
must  execute  so  that  all  interprocessor  data  transmission  is  correctly  carried  out.  By  contrast, 
if  individual  fetches  and  stores  were  to  be  carried  out  during  the  actual  computation,  things 
would  be  much  more  awkward.  For  example,  in  such  a  case  processor  A  might  obtain  the 
contents  of  a  distributed  array  element  which  is  not  on  A  by  sending  a  message  to  processor 
B  associated  with  the  array  element.  Processor  B  would  be  programmed  to  anticipate  a 
request  of  this  type,  to  satisfy  the  request  and  to  return  a  responding  message  containing 
the  contents  of  the  specified  array  element. 
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•  Before  loop  or  code  segment 


(i)  Data  to  be  sent  off-processor  read  from  distributed  arrays 

(ii)  Send/receive  calls  transport  off-processor  data 

(iii)  Data  written  into  hashed  cache 

•  Computation  carried  out 

—  off-processor  reads /writes  go  to  hashed  cache 

•  At  end  of  loop  or  code  segment 

(i)  Data  to  be  stored  off-processor  read  from  hashed  cache 

(ii)  Send/receive  calls  transport  off-processor  data 

(iii)  Data  written  back  into  distributed  arrays  for  longer  term  storage 

Figure  5:  Executor  For  Parallel  Loop  on  Distributed  Memory  Multiprocessor 

Once  preprocessing  is  completed,  we  are  in  a  position  to  carry  out  the  necessary  commu¬ 
nication  and  computation,  Figure  5  outlines  the  steps  involved.  The  initial  data  exchange 
phase  follows  the  plan  established  by  the  inspector.  During  preprocessing,  each  proces¬ 
sor  finds  out  which  distributed  array  elements  need  to  be  transmitted.  When  a  processor 
obtains  copies  of  off-processor  distributed  array  elements,  the  copies  are  written  into  the 
processor’s  hashed  cache.  Once  the  communication  phase  is  over,  each  processor  carries  out 
its  computation.  Each  processor  uses  locally  stored  portions  of  distributed  arrays  along  with 
off-processor  distributed  array  elements  stored  in  the  hashed  cache.  When  the  computational 
phase  is  finished,  distributed  array  elements  to  be  stored  off-processor  are  obtained  from  the 
hashed  cache  and  sent  to  the  appropriate  off-processor  locations. 

There  are  many  situations  in  which  simple,  easily  specified  distributed  array  partitions 
are  inappropriate.  For  instance  when  we  compute  using  an  unstructured  mesh,  we  attempt 
to  partition  the  problem  so  that  each  processor  performs  approximately  the  same  amount  of 
work  and  so  that  the  communications  overhead  is  minimized.  Typically,  it  is  not  possible  to 
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Each  processor  P: 


—  preprocesses  its  own  loop  iterations 

—  Records  off-processor  fetches  and  stores  in  hashed  cache 

—  Consults  distributed  translation  table  to 

*  Find  location  in  distributed  memory  for  each  off- processor  fetch  or  store 

—  Finds  send/receive  calls  required  for  data  exchange 

(i)  P  generates  list  of  all  off-processor  data  to  be  fetched 

(ii)  Other  processors  tell  P  which  data  to  send 

(iii)  Send/Receive  pairs  generated  and  stored 


Figure  6:  Inspector  For  Parallel  Loop  Using  Irregular  Distributed  Array  Mapping 

express  the  resulting  array  partitions  in  a  simple  way.  If  we  allow  an  arbitrary  assignment  of 
distributed  array  elements  to  processors,  the  data  structure  used  to  describe  the  partitioning 
will  have  the  same  number  of  elements  as  the  distributed  array. 

In  order  to  access  an  array  element,  we  need  to  know  where  the  element  is  stored  in 
the  memory  of  the  distributed  machine.  We  use  a  distributed  translation  table  defined  by 
a  partitioning  algorithm,  to  describe  the  mapping.  When  a  distributed  translation  table 
is  used  to  describe  array  mappings,  inspectors  must  be  modified  so  that  they  access  the 
distributed  table.  Using  an  irregular  array  mapping  does  not  alter  the  form  of  the  executor. 
The  modifications  to  be  made  to  an  inspector  are  outlined  in  Figure  6. 

4.2  Languages  and  Tools  for  Irregular  Problems 

Programs  designed  to  carry  out  sparse  direct  and  iterative  methods  also  typically  require 
many  of  the  optimizations  described  in  Section  4.1.  Some  examples  of  such  programs  are 
described  in  [3],  [18],  [15],  [4].  Williams  [34]  describes  a  programming  environment  (DIME) 
for  calculations  with  unstructured  triangular  meshes  using  distributed  memory  machines.  In 
[34],  collections  of  distributed  array  accesses  are  translated  into  an  efficient  set  of  inter-node 
messages.  The  DIME  programming  environment  embodies  many  of  the  principles  discussed 
in  Section  4.1.  The  optimizations  discussed  in  the  last  section  can  be  incorporated  into 
distributed  compilers.  Runtime  compilation  for  distributed  machines  was  proposed  in  [25]; 
this  description  was  in  the  context  of  the  Crystal  language.  Distributed  memory  runtime 
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compilation  was  expanded  upon  in  [20];  which  outlines  the  principals  behind  the  PARTI 
project.  A  more  detailed  description  of  the  concepts  behind  distributed  memory  runtime 
compilation  is  found  in  [26],  and  [21].  The  idea  of  splitting  a  loop  into  an  inspector  and 
executor  and  integrating  this  into  a  compiler  was  also  developed  independently  as  part  of 
the  KALI  project  [17].  Other  compiler  projects  have  also  proposed  run  time  resolution  of 
communications  on  distributed  machines  [8],  [22],  [23],  These  compilers  do  not  carry  out 
the  kinds  of  run  time  optimizations  of  the  sort  described  here. 

We  have  designed  a  set  of  procedures  or  primitives  that  do  the  work  needed  to  implement 
inspectors  and  executors.  We  have  also  designed  and  implemented  a  model  compiler  that 
recognizes  a  subset  of  Fortran  (ARF  -  ARguably  Fortran)  and  generates  inspector  and  ex¬ 
ecutor  loops  with  embedded  primitives.  Distributed  arrays  can  be  declared  in  YRF  source. 
These  distributed  arrays  can  either  be  partitioned  between  processors  in  a  uniform  manner 
(e.g.  equal  sized  blocks  of  contiguous  array  elements  assigned  to  each  processor),  alternately, 
distributed  arrays  can  be  partitioned  in  an  irregular  manner.  When  an  array  is  to  be  par¬ 
titioned  in  an  irregular  fashion,  mapping  information  is  specified  in  an  integer  array.  This 
integer  array  is  typically  produced  by  a  partitioning  procedure.  Element  i  of  the  integer  ar¬ 
ray  describes  the  processor  to  which  element  i  of  the  distributed  array  is  to  be  mapped.  For 
example,  consider  the  ARF  declaration,  distributed  irregular  using  map  real  y(4). 
This  declaration  denotes  a  four  element  real  array  y  that  is  to  be  distributed  according  to 
integer  array  map. 

Embedded  primitives  include  communications  procedures  designed  to  support  irregular 
patterns  of  distributed  array  access.  Other  primitives  that  involve  interprocessor  commu¬ 
nication  initialize  distributed  translation  tables  or  access  distributed  translation  tables  to 
find  the  location  of  irregularly  mapped  distributed  array  data.  Primitives  also  support  the 
maintenance  of  hashed  caches.  (Recall  from  Section  4.1  that  hashed  caches  store  copies  of 
off-processor  distributed  array  data.)  There  are  also  PARTI  primitives  that  perform  accu¬ 
mulations  to  off-processor  distributed  array  elements. 

In  Figure  7  we  present  a  simple  example  of  an  ARF  program.  The  procedure  to  be 
presented  is  a  block  sparse  matrix  vector  multiply,  obtained  from  an  iterative  solver  nroduced 
for  a  program  designed  to  calculate  fluid  flow  for  geometries  defined  by  an  unstructured  mesh 
[31].  The  ARF  compiler  uses  information  in  integer  array  row  to  make  calls  to  primitives 
that  initialize  the  distributed  translation  tables.  These  distributed  translation  tables  are 
used  to  describe  the  mapping  of  x,  y  ,  cols  and  ncols  (statements  Si  and  S2).  The 
current  version  of  ARF  distributes  only  the  last  declared  dimension  in  a  multidimensional 
array,  although  the  PARTI  primitives  do  support  a  broader  class  of  array  mappings  [6]. 
As  of  the  time  of  writing,  the  ARF  compiler  does  not  include  syntax  that  specifies  where 
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computational  work  is  to  be  performed.  Partitioning  procedures  specify  where  work  is  to  be 
carried  out,  but  the  interface  between  partitioning  procedures  and  ARF  has  not  yet  been 
automated. 

In  Figure  7,  array  x  is  indexed  by  m  and  cols(j  ,i).  If  we  were  to  carry  out  the  work 
in  this  loop  in  a  naive  manner,  we  would  have  to  fetch  each  individual  distributed  array 
element  x(m,cols(j  ,i))  (statement  S4)  from  its  assigned  processor.  Since  the  processor 
assignments  of  elements  of  x  are  stored  in  a  distributed  translation  table,  we  would  also  need 
to  access  the  memory  of  the  processor  that  keeps  track  of  where  x(m,cols(j  ,i))  is  stored. 

In  Table  1,  we  present  the  execution  times  on  32  processor  and  64  processor  Intel 
iPSC/860  machines,  obtained  from  the  block  matrix  vector  multiply  kernel  as  well  as  exe¬ 
cution  times  from  another,  more  complex  kernel  that  arose  in  an  unstructured  code.  This 
kernel,  to  be  referred  to  here  as  fluxroe,  computes  convective  fluxes  using  a  method  based  on 
Roe’s  approximate  Riemann  solver  [32],  [33];  the  kernel  is  discussed  in  some  detail  in  [6]. 
Both  the  block  matrix  vector  multiply  and  the  fluxroe  kernel  arise  from  iterative  algorithms. 
In  these  tests,  fluxroe  was  translated  into  ARF  and  compiled.  In  these  experiments,  we  used 
two  different  unstructured  meshes: 

(i)  A  21,672  element  mesh  generated  to  carry  out  an  aerodynamic  simulation  involving  a 
multielement  airfoil  in  a  landing  configuration  [19] 

(ii)  A  37,741  element  mesh  generated  to  simulate  a  4.2  %  circular  arc  airfoil  in  a  channel 

[14]. 

In  all  the  cases  presented  below,  each  unstructured  mesh  was  partitioned  by  recursive 
orthogonal  dissection  [13]. 

In  table  1  we  present: 

inspector  time  -  time  required  to  carry  out  the  inspector  preprocessing  phase 

computation  time  -  the  time  required  to  perform  computations  in  the  iterative  portion 
of  the  program 

communication  time  -  the  time  required  to  exchange  messages  within  the  iterative 
portion  of  the  program. 

The  inspector  time  includes  the  time  required  to  set  up  the  needed  distributed  translation 
table  as  well  as  the  time  required  to  access  the  distributed  translation  table  when  carrying 
out  the  preprocessing  in  the  inspector.  In  these  experiments,  the  ratio  of  the  time  required  to 
carry  out  the  inspector  to  the  computation  time  required  for  a  single  iteration  ranged  from  a 
factor  of  3  to  a  factor  of  5.  Most  of  the  preprocessing  time  goes  to  setting  up  and  using  the 
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distributed  translation  table.  For  instance,  consider  the  block  matrix  vector  multiply  on  64 
processors  using  the  21,672  element  mesh.  The  total  preprocessing  cost  was  122  milliseconds, 
of  which  111  milliseconds  went  to  translation  table  related  work. 

We  can  define  parallel  efficiency  for  a  given  number  of  processors  P  as  the  sequential 
time  divided  by  the  product  of  the  execution  time  on  P  processors  times  P.  In  Table  1  we 
depict  under  the  heading  of  single  sweep  efficiency ,  the  parallel  efficiencies  we  would  obtain 
were  we  required  to  preprocess  the  kernel  each  time  we  carried  out  calculations.  In  reality, 
preprocessing  time  can  be  amortized  over  multiple  mesh  sweeps.  If  we  neglect  the  time 
required  to  preprocess  the  problem  in  computing  parallel  efficiencies,  we  obtain  the  second 
set  of  parallel  efficiency  measurements,  the  amortized  efficiency  presented  in  Table  1.  The 
amortized  efficiencies  for  64  processors  ranged  from  0.48  to  0.59,  while  the  single  sweep 
efficiencies  ranged  from  0.10  to  0.17. 

In  the  experiments  depicted  in  Table  1,  the  time  spent  computing  is  at  least  a  factor  of  2 
greater  than  the  communication  time.  The  amortized  efficiencies  are,  however,  impacted  by 
the  fact  that  the  computations  in  the  parallelized  codes  are  carried  out  less  efficiently  than 
those  in  the  sequential  program.  The  parallel  code  spends  time  accessing  the  hashed  cache. 
It  also  needs  to  perform  more  indirections  thp,n  the  sequential  program. 


Table  1:  Performance  on  different  number  of  processors 
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51  distributed  irregular  using  row  real  x(4,n),  y(4,n) 

52  distributed  irregular  using  row  integer  cols(9,n),  ncols(n) 

...  initialization  of  local  variables  ... 

doall  i=l,n 

do  j=l,ncols(i) 

S3  do  k=l,4 

sum  =  0 
do  m  =  1,4 

S4  sum  =  sum  +  f(i,m,k,i)*x(m,cols(j,i)) 
enddo 

y(k,i)  =  y(k,i)  +  sum 
enddo 

enddo 

enddo 

Figure  7:  ARF  Kernel  From  Unstructured  CFD  Code 
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4.3  Future  Optimizations 

Most  of  the  optimizations  described  in  this  section  are  motivated  either  directly  or  indirectly 
by  the  high  communication  latencies  typically  found  in  distributed  memory  computers.  Be¬ 
cause  we  can  anticipate  all  of  the  interprocessor  communications  that  will  be  needed  in 
carrying  out  a  loop,  we  have  the  information  we  need  to  schedule  interprocessor  communi¬ 
cations  to  reduce  overheads  due  to  contention.  As  we  shall  see  in  Section  5,  scheduling  of 
interprocessor  communication  has  already  been  shown  to  be  an  important  optimization  for 
some  SIMD  architectures.  We  expect  this  to  also  turn  out  to  be  a  fruitful  optimization  for 
distributed  memory  MIMD  computers. 

Computations  can  be  characterized  by  patterns  of  data  dependency.  Procedures  that  par¬ 
tition  data  structures  and  computational  work  take  these  dependency  patterns  into  account. 
It  is  possible  to  design  program  transformations  that  generate  procedures  which  output  a 
record  of  the  dependency  patterns  in  a  loop  nest  in  a  a  standard  representation  [20].  Stan¬ 
dardized  partitioning  programs  that  use  these  data  structures  can  then  be  employed. 

5  Runtime  Compilation  in  SIMD  Machines  -  the  Com¬ 
munications  Compiler 

Irregular  problems  can  cause  serious  performance  degradation  on  the  CM-2  [5].  It  turns 
out  that  this  performance  degradation  can  be  ameliorated  by  a  form  of  runtime  compila¬ 
tion.  Denning  Dahl  has  developed  a  set  of  software  facilities  for  the  Connection  Machine 
(CM-2)  that  are  designed  to  handle  applications  that  exhibit  fixed  irregular  patterns  of 
communication  [10].  One  procedure,  the  communications  compiler  schedules  interprocessor 
communications.  The  other  procedure  a  mapping  facility  maps  graphs  generated  from  a 
communication  pattern  onto  the  CM-2.  In  this  paper,  we  will  focus  our  attention  on  the 
communications  compiler. 

The  communications  compiler  decomposes  an  irregular  communications  pattern  into  a 
sequence  of  simple,  inexpensive  data  transfers.  These  data  transfers  make  use  of  the  hy¬ 
percube  communication  network  in  the  CM-2.  In  the  CM-2,  all  links  of  the  hypercube 
can  simultaneously  carry  bidirectional  information.  The  communications  compiler  attempts 
to  reduce  time  required  for  communication  by  the  judicious  scheduling  of  messages.  At 
present,  the  communications  compiler  is  accessed  by  procedure  calls.  Lists  of  destination 
addresses  are  passed  to  the  communication  compiler’s  preprocessor  procedures.  Once  the 
preprocessing  is  completed,  a  data  delivery  function  carries  out  the  scheduled  communica¬ 
tions.  Transformations  analogous  to  those  discussed  in  Section  4.2  could  be  used  to  embed 
these  communication  compiler  primitives  into  programs,  and  hence  to  generate  inspectors 
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and  executors. 

We  present  a  set  of  benchmarks  that  quantifies  the  performance  effects  of  the  commu¬ 
nications  compiler.  A  synthetic  workload  was  defined  in  the  following  way.  A  square  mesh 
in  which  each  point  was  linked  to  four  nearest  neighbors  was  incrementally  distorted.  Ran¬ 
dom  edges  were  introduced  subject  to  the  constraint  that  in  the  new  mesh,  each  point  still 
required  information  from  four  other  mesh  points. 

The  following  assumptions  are  inherent  in  our  workload  generator:  makes 

(i)  The  problem  domain  consists  of  a  64  by  128  mesh  of  points  which  are  numbered  using 
their  row  major  or  natural  ordering; 

(ii)  Each  point  is  initially  connected  to  its  four  nearest  neighbors 

(iii)  Each  link  produced  in  the  above  step  is  examined,  with  probability  q  the  link  is  replaced 
by  a  link  to  a  randomly  chosen  point. 

An  8192  processor  Connection  Machine-2  was  configured  as  a  64  by  128  torus.  The 
mesh  was  mapped  onto  the  torus  in  the  obvious  manner.  A  sweep  over  the  mesh  was  then 
performed  using  the  following  communication  mechanisms: 

(i)  Get:  The  standard  CM-2  general  router  is  called  four  times,  once  for  each  of  the  four 
off-processor  data  elements  needed  by  each  processor. 

(ii)  Compiled  get:  Communications  compiled  using  the  communications  compiler;  the  com¬ 
munications  compiler  preprocessor  was  called  four  times,  once  for  each  of  the  four  off- 
processor  data  elements  required  by  each  processor.  The  data  delivery  procedure  is 
called  four  times  during  each  mesh  sweep. 

(iii)  Compiled  gather:  Communications  compiled  using  the  communications  compiler;  a 
single  call  to  the  communications  compiler  preprocessor  handles  each  processor’s  four 
data  requests.  For  each  iteration,  a  single  data  delivery  function  carries  out  all  com¬ 
munication. 

(iv)  NEWS:  CM-2  communications  procedures  that  transmit  information  using  mesh  em¬ 
bedded  into  hypercube  by  binary  reflected  gray  code.  NEWS  was  only  used  to  bench¬ 
mark  the  completely  uniform  mesh  (q  =  0  ). 

The  construction  of  the  communication  schedule  took  anywhere  from  1  to  13  seconds. 

The  results  of  these  benchmarks  are  depicted  in  Figure  8.  In  these  experiments  we 
carried  out  sweeps  over  meshes  generated  by  varying  q  from  0.0  to  0.5.  For  the  uniform  mesh 
(q=0),  we  used  all  four  communications  mechanisms  described  above.  For  the  synthetically 
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generated  irregular  meshes,  we  used  the  standard  CM  get, the  compiled  get  and  the  compiled 
gather.  Let  Thews  represent  the  the  time  required  by  the  CM-2  to  to  sweep  over  a  regular 
mesh  (q=0)  using  the  NEWS  mechanism;  Thews  was  equal  to  0.80  milliseconds.  In  Figure 
8,  we  compare  Thews  with  the  time  taken  by  the  CM-2  to  sweep  over  irregular  meshes 
using  the  standard  CM  get  ( Tget),  the  compiled  get  ( Tcget )  and  the  compiled  gather 
( Tcgather )•  For  the  regular  mesh,  Tget ,  Tcget ,  and  Tcgather  were  factors  of  15.4,  2.2 
and  1.1  times  larger  than  Thews-  As  q  increased,  the  performance  of  the  mesh  sweep 
degraded  significantly  with  all  three  routing  mechanisms  tested.  For  q  =  0.5,  TGet,  Tcget > 
and  Tcgather  were  factors  of  22.6,  4.4  and  2.7  times  larger  than  Thews-  It  is  clear  that 
runtime  compilation  techniques  can  play  an  important  role  in  reducing  communications 
costs  for  irregular  problems  on  SIMD  machines.  The  computational  cost  of  the  simulated 
annealing  based  communications  compiler  is,  however,  extremely  high. 

6  Conclusions 

Execution  time  preprocessing  plays  a  major  role  in  many  efficient  algorithms  in  computer 
science.  Runtime  preprocessing  also  plays  an  important  role  in  exploiting  multiprocessor 
architectures.  Examples  of  such  preprocessing  include  runtime  parallelization,  runtime  ag¬ 
gregation  and  scheduling  of  remote  distributed  array  accesses  and  execution  time  data  and 
workload  partitioning.  We  have  given  examples  of  how  optimizations  of  this  type  can  be 
integrated  into  compilers.  We  have  also  presented  specific  benchmarks  that  document,  on  a 
range  of  multiprocessor  architectures,  the  importance  of  various  types  of  runtime  compila¬ 
tion. 
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